Vishant Bhatia(0798567), Tulaib Bin Ayyub(0789141), Alisha Mahajan(0802631)

We affirm that we are the authors of this project and while completing it, we have followed St.Clair’s college’s policies on academic integrity.

R version used- R version 4.2.1 (2022-06-23 ucrt)

RStudio used- 2022.07.1 version

List of R packages used- tidyverse, here, plotly, ggplot2.

Library Imported and their versions
  • tidyverse : 1.3.2
  • here : 1.0.1
  • plotly : 4.10.0
  • ggplot2 : 3.3.6
  • Contribution of each member:

  • Prepared two plots displaying information about a single categorical variable.
  • Prepared one plot displaying information about both a continuous variable and a categorical variable.
  • Answered Why Data Visualization is important to understand a datasets.

  • Prepared two plots displaying information about a single continuous variable
  • Two plots should display information that shows a relationship between two variables.
  • Answered Why Data Visualization is important to communicating important aspects of datasets.

  • One plot should use faceting and display information about 4 variables.
  • Prepared compitition plot.
  • Answered the role of integrity as an Analyst while creating a Data Visualization for communicating results to others.
  • Answered the number of variables that can be successfully represent in a visualization.

  • Dataset: Indian Election 2019

    https://www.kaggle.com/code/paramarthasengupta/eda-plotly-prediction-indian-elections-2019

    Datatypes and Variables
  • STATE—-chr type
  • CONSTITUENCY—-chr type
  • NAME—-chr type
  • WINNER—-int type
  • PARTY—-chr type
  • SYMBOL—-chr type
  • GENDER—-chr type
  • CRIMINAL.CASES—-chr type
  • AGE—-int type
  • CATEGORY—-chr type
  • EDUCATION—-chr type
  • ASETS—-chr type
  • LIABILITIES—-chr type
  • GENERAL.VOTES—-int type
  • POSTAL.VOTES—-int type
  • TOTAL.VOTES—-int type
  • OVER.TOTAL.ELECTORS.IN.CONSTITUENCY→ num type
  • OVER.TOTAL.VOTES.POLLED..IN.CONSTITUENCY→ num type
  • TOTAL.ELECTORS—-int type

  • library("tidyverse")
    ## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
    ## ✔ ggplot2 3.3.6      ✔ purrr   0.3.5 
    ## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
    ## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
    ## ✔ readr   2.1.3      ✔ forcats 0.5.2 
    ## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
    ## ✖ dplyr::filter() masks stats::filter()
    ## ✖ dplyr::lag()    masks stats::lag()
    library("here")
    ## here() starts at C:/F Drive/BAsic stats DAB 501/R programs
    library("plotly")
    ## 
    ## Attaching package: 'plotly'
    ## 
    ## The following object is masked from 'package:ggplot2':
    ## 
    ##     last_plot
    ## 
    ## The following object is masked from 'package:stats':
    ## 
    ##     filter
    ## 
    ## The following object is masked from 'package:graphics':
    ## 
    ##     layout
    library("ggplot2")

    IMPORTING DATA

    indian_df <- read.csv("C:/F Drive/BAsic stats DAB 501/R programs/Indian_Election_2019.csv")

    1 Question Plot-1

     ggplot(data = indian_df,mapping = aes(POSTAL.VOTES))+
      geom_histogram(binwidth = 250,color="black",fill="yellow")+
      labs(title ="TOTAL NUMBER OF POSTAL VOTES",caption = "Data Source: Indian Election 2019",x = "Number of Postal Votes",y = "Number of Votes")+
      theme(plot.title = element_text(color="blue", size=12, face="bold"), 
            plot.caption = element_text(colour="red",size= 8,face = "bold",hjust = 1))

    Changes in Dataframe:

    Values were too high so divided GENERAL.VOTES,TOTAL.VOTES, TOTAL.ELECTORS by 10000(Ten Thousand) to get readable values while ploting
    indian_df <- indian_df %>% mutate(GENERAL.VOTES=GENERAL.VOTES/10000)
    indian_df <- indian_df %>% mutate(TOTAL.VOTES = TOTAL.VOTES/10000)
      indian_df <- indian_df %>%  mutate(TOTAL.ELECTORS=TOTAL.ELECTORS/10000)
    party_df <- indian_df %>% filter(PARTY %in% c("BJP","INC","AAP","SP","TDP","SAP", "MNM","CPI(ML)(L)","AIADMK","SAD(M)","DMDK","SAD","NOTA"))
     num_data <- transform(party_df,CRIMINAL.CASES = as.numeric(CRIMINAL.CASES))
    ## Warning in eval(substitute(list(...)), `_data`, parent.frame()): NAs introduced
    ## by coercion

    1 Question Plot-2

    tvotes_density <- ggplot(data = indian_df,mapping = aes(TOTAL.VOTES))+
      geom_histogram(aes(y = ..density..),binwidth=10,color="red",fill="blue")+
      geom_density(color = "#000000", fill = "#F85700", alpha = 0.6) +
      #ggtitle("DENSITY ACCORDING TO TOTAL VOTES ") +
      labs(x = "Total Votes in Ten Thousand", y = "Density",caption = "Data Source:Indian Election 2019",title = "DENSITY ACCORDING TO TOTAL VOTES ")+
      theme(plot.title = element_text(color="blue", size=12, face="bold"))
      
    ggplotly(tvotes_density)

    2 Question Plot-1

    #lab_gender <- c("Male", "Female", "Others")
    
    ggplot(indian_df,mapping = aes(x = GENDER))+
      scale_x_discrete(breaks = c("MALE", "FEMALE", ""),
                       labels = c("Male", "Female", "Others"))+
      geom_bar(color="red",fill="yellow")+
      geom_text(aes(label = ..count..), stat =   "count",
                position = position_stack(vjust=0.5))+
      labs(x = "Gender", y = "Count",caption = "Data Source:Indian Election 2019",title = "GENDER WISE COUNT PARTICIPATING IN INDIAN ELECTION 2019")+
      theme(plot.title = element_text(color="blue", size=12, face="bold"), 
            plot.caption = element_text(colour="red",size= 8,face = "bold",hjust = 1))

    # 2 Question Plot-2

    ggplot(indian_df,mapping=aes(x = CATEGORY))+ 
      scale_x_discrete(breaks = c("ST", "", "SC","GENERAL"),
                       labels = c("ST", "Others", "SC","General"))+
      geom_bar(color="black",fill = "cyan")+
      coord_flip()+
      geom_text(aes(label = ..count..), stat =   "count",
                position = position_stack(vjust=0.5))+
      labs(x = "Category", y = "Count",title = "No. of Candidates according to Category",caption = "Data Source:Indian Election 2019")+
      theme(plot.title = element_text(color="blue", size=12, face="bold"), 
            plot.caption = element_text(colour="red",size= 8,face = "bold",hjust = 1))

    # 3 Question

    party_crime <- ggplot(num_data,aes(PARTY,CRIMINAL.CASES))+
      geom_boxplot(aes(fill= PARTY), outlier.stroke = 2, color="black", outlier.shape=16,outlier.size=1, notch=FALSE) +
      labs(title = "CRIMINAL CASES OF A PARTICULAR PARTY", x = "Party Name", y = "Criminal Cases",caption = "Data Source:Indian Election 2019")+
      theme(plot.title = element_text(color="blue", size=12, face="bold"), 
            plot.caption = element_text(colour="red",size= 8,face = "bold",hjust = 1))
    
    party_crime
    ## Warning: Removed 248 rows containing non-finite values (stat_boxplot).

    4 Question Plot-1

      symb_party <- ggplot(party_df, aes(y=PARTY,x= SYMBOL))+
      geom_point(color="red")+
      labs(title="SCATTER PLOT SHOWING DETAILS OF PARTY WITH THEIR SYMBOLS",
           caption = "Data Source: Indian Election 2019", y = "Party Name" , x = "Different Party Symbols")+
      theme(plot.title = element_text(color="blue", size=12, face="bold"))
    
    ggplotly(symb_party)

    4 Question Plot-2

    electrol_state <- ggplot(indian_df, aes(y = STATE, x = TOTAL.ELECTORS))+
      geom_bar(stat="identity", width = 0.9, fill="pink") +
      labs(title="TOTAL ELECTORS IN A PARTICULAR STATE", y = "States", x = "Total Electors in Ten Thousand") +
      theme(plot.title = element_text(color="blue", size=12, face="bold"))
    
    ggplotly(electrol_state)

    5 Question

    ggplot(party_df, aes(x=WINNER, y=PARTY, fill=GENDER))+
      geom_bar(stat = "identity")+
      facet_wrap(~CATEGORY,nrow = 2)+
      labs(title="Displaying info of Winners and Party Name according to Gender",
         caption = "Data Source: Indian Election 2019", x = "Winners", y = "Party Names")+
      theme(plot.title = element_text(color="blue", size=12, face="bold"), 
            plot.caption = element_text(colour="red",size= 8,face = "bold",hjust = 1))

    6 Question

    point <- ggplot(party_df,mapping = aes(y=PARTY,x=STATE))+
      geom_point(aes(color=WINNER))+
      theme(axis.text.x = element_text(angle=75, vjust=0.6),axis.text.y =               element_text(angle=-75, vjust=0.4)) +
      labs(title = "WINNING CANDIDATES FROM PARTICULAR PARTY AND STATE", x = "State Name", y = "Party Name")+
      theme(plot.title = element_text(color="blue", size=12, face="bold"))
      
    ggplotly(point)

    Questions

    Q1. In what ways do you think data visualization is important to understanding a data set?

    Ans. Data visualization is important because it helps in understanding and highlighting the trends and the outliers. It tells a story and remove unwanted things from the data and point out the useful informations.

    Q2. In what ways do you think data visualization is important to communicating important aspects of a data set?

    Ans. Data visualization is very important to communicating important aspects of a data set as it visualize both numerical and categorical data which helps in proper understanding of data and also reduce the risk factor. It properly tells the context of data and gives proper relation between the varibles.

    Q3. What role does your integrity as an analyst play when creating a data visualization for communicating results to others?

    Ans. It is very crucial for a data analyst to manage their data through security monitoring. It is the prior responsibility of the data integrity analyst to track proper records of company’s data and to ensure that it is being handled with security and to safe from unauthorised access. Also its their responsibility to check authenticity and to prevent from copyrights.

    Q4. How many variables do you think you can successfully represent in a visualization? What happens when you exceed this number?

    Ans.To get a successful visualization, it is pivot to plot enough no. of variables that extract efficient information. If we take limited variables let suppose 6-7, it gives us clear, understandable and efficient graph. On the country, if we take multivariables with large numbers of data, it will give us complex information which is not easy to understand and also it will lose its real purpose.